82        Bioinformatics

the number of insertions (I) or deletion (D) from the CIGAR strings in a BAM file. Since

the CIGAR field is the sixth column, first, we will use “samtools view -F 0x4 SRR769545_

mem_sorted.bam” to extract the mapped records. Then, we can transfer that output to “cut

-f 6” using the pipe symbol “|” to separate the sixth column. The output is then transferred

to “grep -P” to select only the strings that have either the character “D” or “I” using the

class pattern “[ID]” to match any of the two characters. Then, the output is transferred to

the “tr” command to delete any characters other than “I” and “D”. Finally, the output is

transferred to the “wc -c” command to count the remaining characters:

samtools view \

-F 0x4 SRR769545_mem_sorted.bam \

| cut -f 6 \

| grep -P ‘[ID]’ \

| tr -cd ‘[ID]’ \

| wc -c

To count insertions and deletions separately, use the following, respectively:

samtools view \

-F 0x4 SRR769545_mem_sorted.bam \

| cut -f 6 \

| grep -P ‘I’ \

| tr -cd ‘I’ \

| wc -c

samtools view \

-F 0x4 SRR769545_mem_sorted.bam \

| cut -f 6 \

| grep -P ‘D’ \

| tr -cd ‘D’ \

| wc -c

Refer to Table 2.3 for the different FLAG values and descriptions.

2.4.1.6  Removing Duplicate Reads

Duplicate reads may be produced from the library construction, PCR amplification (PCR

duplicates), or a fault in the sequencing optical sensor (optical duplicates). A large number

of duplicate reads originating from a single fragment may create a bias in some applica-

tions, such as RNA-Seq, in which the count of reads has a biological interpretation. The

“samtools rmdup” command can be used to remove potential duplicate reads from BAM/

SAM files. If multiple reads have identical coordinates, only the read (read pair if paired

end) with the highest mapping quality will be retained. By default, this command works

for paired-end reads. The option “-s” is used if the reads are single end.

samtools rmdup \

SRR769545_mem_sorted.bam \